Evaluating natural language systems: a sourcebook approach

نویسندگان

  • Walter Read
  • Alex Quilici
  • John Reeves
  • Michael G. Dyer
  • Eva Baker
چکیده

This paper reports progress in development of evaluation methodologies for natural language systems. Without a common classification of the problems in natural language understanding authors have no way to specify clearly what their systems do, potential users have no way to compare different systems and researchers have no way to judge the advantages or disadvantages of different approaches to developing systems. i n t r o d u c t i o n . Recent years have Seen a proliferation of natural language systems. These include both applied systems such as database front-ends, expert system interfaces and on-line help systems and research systems developed to test particular theories of language processing. Each system comes with a set of claims about what types of problems the system can "handle". But what does "handles ellipsis" or "resolves anaphoric reference" actually mean? All and any such cases? Certain types? And what classification of ' types' of ellipsis is the author using? Without a common classification of the problems in natural language understanding authors have no way to specify clearly what their systems do, potential users have no way to compare different systems and researchers have no way to judge the advantages or disadvantages of different approaches to developing systems. While these problems have been noted over the last 10 years (Woods, 1977; Tennant, 1979), research developing specific criteria for evaluation of natural language systems has appeared only recently. This paper reports progress in development of evaluation methodologies for natural language systems. This work is part of the Artificial Intelligence Measurement System (AIMS) project of the Center for the Study of Evaluation at UCLA. The AIMS project is developing evaluation criteria for expert systems, vision systems and natural language systems. i~revious W o r k on N a t u r a l L a n g u a g e Eva lua t ion . Woods (1977) discussed a number of dimensions along which nro~re~s in development of natural language systems can be *This work reported here is part of the Artificial Intelligence Measurement Systems (AIMS) Project, which is supported in part by ONR contract number N00014-S6-K-0395. measured. In particular, he considered approaches via a %axonomy of linguistic phenomena" covered, the convenience and perspicuity of the model used and the time used in processing. As Woods points out, the difficulty of a taxonomic approach is that the taxonomy will always be incomplete. Any particular phenomenon will have many subclasses and it often turns out tha t the pubhshed examples cover only a small part oZ the problem. A system might claim "handles pronoun reference" but the examples only cover parallel constructions. To make such a taxonomy useful we have to identify as many subclasses as possible. On the positive side, if we can bui ld such a taxonomy, it will allow authors to state clearly just what phenomena they are making claims about. It could serve not only as a description of what has been achieved but as a guide to what still needs to be done. Woods provides a useful discussion of the difficulties involved in each of these approaches but offers no specific evaluative criteria. He draws attention to the great effort involved in doing evaluation by any of these methods and to the importance of a "detailed case-by-case analysis". Our present work is an implementation and extension of some of these ideas. Tennant and others (Tennant 1979; Finin, Goodman & Tennant, 1979) make a distinction between conceptual coy.. erage and linguistic coverage of a natural language system and argue that systems have to be measured on each of t h ~ e dimensions. Conceptual coverage refers to the range of concepts handled by the system and linguistic coverage to the range of language used to discuss the concepts. Tennant suggests a possible experimental separation between conceptual and linguistic coverage. The distinction these authors make is important and useful, in part for emphasizing the significance of the knowledge base for usability of a natm'al language system. But the examples that Tennant gives for conceptual completeness presupposition, reference to discourse objects seem to be

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reuse and Challenges in Evaluating Language Generation Systems: Position Paper

Although there is an increasing shift towards evaluating Natural Language Generation (NLG) systems, there are still many NLG-specific open issues that hinder effective comparative and quantitative evaluation in this field. The paper starts off by describing a task-based, i.e., black-box evaluation of a hypertext NLG system. Then we examine the problem of glass-box, i.e., module specific, evalua...

متن کامل

SoLDES: Service-oriented Lexical Database Exploitation System

In this work, we focuses on the assisted exploitation of lexical databases designed according to the LMF standard (Lexical Markup Framework) ISO-24613. The proposed system is a service-oriented solution which relies on a requirement-based lexical web service generation approach that expedites the task of engineers when developing NLP (Natural Language Processing) systems. Using this approach, t...

متن کامل

Response Quality Evaluation in Heterogeneous Question Answering System: A Black-box Approach

The evaluation of the question answering system is a major research area that needs much attention. Before the rise of domain-oriented question answering systems based on natural language understanding and reasoning, evaluation is never a problem as information retrieval-based metrics are readily available for use. However, when question answering systems began to be more domains specific, eval...

متن کامل

Sequence to Sequence Modeling for User Simulation in Dialog Systems

User simulators are a principal offline method for training and evaluating human-computer dialog systems. In this paper, we examine simple sequence-to-sequence neural network architectures for training end-to-end, natural language to natural language, user simulators, using only raw logs of previous interactions without any additional human labelling. We compare the neural network-based simulat...

متن کامل

A Black-box Approach for Response Quality Evaluation of Conversational Agent Systems

The evaluation of conversational agents or chatterbots question answering systems is a major research area that needs much attention. Before the rise of domain-oriented conversational agents based on natural language understanding and reasoning, evaluation is never a problem as information retrieval-based metrics are readily available for use. However, when chatterbots began to become more doma...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1988